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Abstract 

Gaussian Mixture Models (GMM) have found many applications in density 
estimation and data clustering. However, the model does not adapt well to 
curved and strongly nonlinear data. Recently there appeared an improve¬ 
ment called AcaGMM (Active curve axis Gaussian Mixture Model), which 
fits Gaussians along curves using an EM-like (Expectation Maximization) 
approach. 

Using the ideas standing behind AcaGMM, we build an alternative active 
function model of clustering, which has some advantages over AcaGMM. In 
particular it is naturally defined in arbitrary dimensions and enables an easy 
adaptation to clustering of complicated datasets along the predefined family 
of functions. Moreover, it does not need external methods to determine the 
number of clusters as it automatically reduces the number of groups on-line. 

Keywords: clustering, Gaussian Mixture Models, Expectation 
Maximization, Gross-Entropy Glustering, Active curve axis Gaussian 
Mixture Model. 


1. Introduction 

Glustering plays a basic role in many parts of data engineering, pattern 
recognition and image analysis PEI El ISIS]. One of the most important is 
Gaussian Mixture Models PIZlEli. It is hard to overestimate the role of 
GMM in computer science PIZIEIIS], including object detection dDuniiia 
Emil Eg, object tracking ESEZ], learning and modelling EHIEI], feature 
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selection [HI EOj , classification [211 [22] or statistical background subtraction 

|23l[21ll25|. 

GMM accommodates data of varied structure, e.g. the component dis¬ 
tributions can concentrate around surfaces of lower dimension obtained by 
principal components (PCA) |2S|- However, it often happens that clusters 
are concentrated around lower dimensional manifolds which are not linear. 
Since one non-Gaussian component can often be approximated by several 
Gaussian ones EH, these clusters are in practice represented by introducing 
more Gaussian components which can be seen as a form of piecewise lin¬ 
ear approximation, see Fig. Due to the intrinsic linearity of the Gaussian 
model, when there are nonlinear manifolds in the data cloud, it is natural that 
many components are required and the fitting error is large. Gonsequently, 
the constructed model does not reflect optimally the internal structure of the 
data. A similar result gives Gross Entropy Glustering approach, compare Fig 
|2(a)| and |2(b)[ 



(a) Level set for classical 
Gaussian density. 



(b) Level set of AcaGMM 
Gaussian model. 


Figure 1: Gomparison of level-sets generated by classical Gaussian density and AcaGMM 
model. 

There are several methods attempting to solve the problem of fitting 
nonlinear manifolds, e.g. principal curves and principal surfaces [2Ell29ll30|. 
Principal curves/surfaces algorithms are typically capable of expressing a 
single complex manifold. In EH the authors present an adaptation of the 
Gaussian Mixture Model called Active curve axis Gaussian Mixture Models 
(AcaGMM), which uses a nonlinear curved Gaussian probability model in 
clustering. In its basic version it works with data on the plane and adapts 
to the quadratic curves. In other words AcaGMM uses a wider class then 
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typical Gaussians - namely Gaussians which are curved over parabolas. 

Since our paper aims at solving the same task as AcaGMM, let us first 
explain the method and present the typical steps behind it. First, using an 
additional tool, the authors find the “right” number of clusters (one of the 
possible methods is given in [H2], however, one can also use [HH])- Then for 
each cluster the PGA algorithm is applied to determine the reasonable basis, 
and a Gaussian curved along the optimal parabola is used. The coordinate 
system is nonlinear, see Fig. 2(c) (the y coordinate is chosen as a distance 
from the parabola, and x is the length on the parabola from the projected 
point to the parabola’s vertex). AcaGMM has found applications in par¬ 
ticularly in human hand motion recognition |Hlj. It can also be fuzzified 

ESI. 



Figure 2: Fitting a b-type set by using (a) GMM, (b) CEC, (c) AcaGMM, (d) afCEC. 

AcaGMM works well in practice, however, it has some limitations. The 
model is naturally restricted to quadratic functions as the nonlinear coordi¬ 
nate system requires the projection onto the graph and length of the curve. 
The use of the method in higher dimensional case, although possible, is 
practically rather limited. Moreover, AcaGMM is not a theoretically based 
density model (see Appendix for the detailed explanation), and therefore it 
is not in fact formally EM based, but only uses its optimization algorithm. 
Gonsequently, contrary to the classical EM EniEZ], the MLE cost function 
does not necessarily decrease with iterations. Let us recall that in general 
EM aims at finding pi,... ,pk >0, fi, ■ ■ ■, fk Gaussian den¬ 

sities (where k is given beforehand and denotes the number of densities which 
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convex combination builds the desired density model) such that the convex 
combination 

/ := Pifi + • • -Pkfk 

optimally approximates the scatter of our data X = {xi, ..., Xn} with respect 
to MLE cost function 

n 

MLE{f,X) := - J]ln(pi/i(x/) + .. .+Pnfn{xi)). (1.1) 

1=1 

The EM procedure consists of the Expectation and Maximization steps. 
While the Expectation step is relatively simple, the Maximization usually 
needs complicated numerical optimization even for relatively simple Gaus¬ 
sian models [aHiEaiio]. 



Figure 3: Result of afCEC algorithm in the case of a 3D shark-type set. 


In this paper we propose the afCEC method which is based on the CEC 
model, instead of the Expectation Maximization (EM) and Gaussian density 
model in a curvilinear coordinate system. A goal of GEG is to minimize 
the cost function, which is a minor modihcation of that given in (1.1) by 
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substituting sum with maximum: 


CEC(/,X) := - ... ,pnfn{xi))). (1.2) 

1=1 


Instead of focusing on the density estimation as its main task, CEC aims 
itself directly to the clustering problem. It occurs that at the small cost 
of minimally worse density approximation |33] we gain speed in implemen- 
tatiorj^ and the ease of using more complicated density models. Roughly 
speaking, the advantage is obtained because models do not mix with each 
other, since we take the maximum instead of sum. 

Consequently, we are able to construct an algorithm which is easy to 
adapt to the higher dimensional case. The results of afCEC and AcaGMM 
are similar on the plane, compare Fig. |2(c) and Fig. |2(d) , The effect of our 
algorithm in on a shark-type set im 02] is shown in Fig. [s} 

The afCEC method is able to reduce unnecessary clusters. In Fig. 
we present a convergence process of afCEC with initial number of clusters 
k = 10, which is reduced to A: = 5. 
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Figure 4: A convergence process of afCEC on a Chinese character with initial k = 10, 
which is reduced to k = b. 

This paper is arranged as follows. In the next section the theoretical 
background of the density model will be presented. Since AcaGMM works in 
only for parabolas we start with a similar situation. Then we describe a 
general model for data in In the third chapter we present the theoretical 


^We can often use the Hartigan approach to clustering which is faster and typically 
finds better minima. 
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background of the afCEC method. In particular, we prove that the cost func¬ 
tion decreases in every iteration, see Theorem |3.2[ The last chapter presents 
numerical experiments. In appendix we include details of the description of 
the AcaGMM model. 


2. /-adapted Gaussian density 

In this section, the /-adapted Gaussian distribution, where / G M) 

is a continuous function, will be presented. The goal of this approach is to 
transform a normal distribution (which assumes the intrinsic linearity of the 
model) to the case of curves (or more generally to manifolds), which are given 
by the graph of the function /. The above model will be used in the afGEG 
method. 


2.1. Toy example in 


Since AcaGMM works in the two-dimensional case (in higher dimensional 
ones the authors use PGA to reduce problems to 2D) with parabolas {f{x) = 
ax'^ -|- 6 for a, 6 G E), we start from comparison AcaGMM and our model in 
such a case. Let f{x) = oa:^ -|- 6 for o, 6 G E be given. The two dimensional 


Gaussian density for m^ = [mi,m 2 ] and covariance matrix S 
given by the following formula 


0-1 

0 


0 

(X2 


is 


A^(m, E)(x) = N{mi,al){xi) • N{m2, o-2)(^2), ( 2 . 1 ) 


where in the one dimensional case we have 


A'(m, cr^)(x) 


,— exp 


X — m 
2(t2 


for m, cr G E. 


Let X = [xi, 0 : 2 ]^ G E^ be given. The AcaGMM approach uses the orthog¬ 
onal projection of the point x onto the parabola / which is denoted by p/(x) 
and the arc length between p/(x) and m which is denoted by //(p/(x),m). 
Gonsequently the AcaGMM function is given by 


Ar(m,S,/)(x) 




2al ) 




V 2^1 } • 


( 2 . 2 ) 


This approach is very intuitive but it causes two basic problems. It is very 
hard (or even impossible) to give explicit formulas for orthogonal projection 


6 











and arc length for more complicated curves in higher dimensional spaces. Cal¬ 
culations are complicated (from the numerical point of view), consequently 
the field of possible generalizations of AcaGMM is limited. Moreover, the 
function which was used in AcaGMM, see formula (2.2), is not a density. The 
Jacobian of the respective transformation was not included (see Appendix). 

In our paper we use a simpler approach, which is based on the Euclidean 
norm and the following formula for the density function /: 

A^(m, S, f){[xi, X 2 \) = N{mi, (Tj){xi) ■ N{m 2 , (tI){x 2 - f{xi)). (2.3) 


Since we do not use orthogonal projection and arc length, it is easy to cal¬ 
culate the parameters of our generalized Gaussian distribution, see Fig. 




(a) (b) 


Figure 5: Density level-sets generated by the /-adapted Gaussian model. 

The practical difference in between AcaGMM and our approach is 
quite smalQ see Fig. Nevertheless, our model is more flexible, as we can 
use an arbitrary class of functions for which least squares methods work. 

2.2. f-adapted Gaussian density 

In this subsection, the general notion of /-adapted Gaussian will be pre¬ 
sented. Let us recall that the standard Gaussian density in is defined 
by 


^In our case we use the parabola ax‘^ + + c instead ax‘^ + c since our method does 

not apply the change of coordinates given by PC A. 
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(a) The AcaGMM method. 





(b) The afCEC method. 


Figure 6: Ellipses generated by AcaGMM and afCEC. 


where m denotes the mean, E is the covariance matrix and ||n||| 
is the square of the Mahalanobis norm. 

In our work we use a multidimensional Gaussian density in a curvilin¬ 
ear coordinate system which is spread along the function /: M (/- 

adapted Gaussian density). We treat one of the variables (for simplicity, 
the last one) separately. In such a case we consider only those E G Add(M) 
(where Add(lR) denotes the set of d-dimensional square matrices) which have 
the diagonal block matrix form 


E 


s.- 0 

0 Ed ’ 


where Ej G Wld-i(lR) and E^ G M. For x = (xi,... ,Xn) G and k = 
1,..., n we will use the notation 

x^ := (xi,..., Xk-i, Xk+i, ...,Xd)e 

For X C E'^, we denote Xf, := (x^: x G X}, the set containing vectors 
from X with removed k coordinate, and Xk := x G X}. For a function 


8 











Figure 7; Level sets for /-adapted Gaussian distribution. 


f: ^ ^ M, we denote 

■= X G X}. 

Definition 2.1. Let f G G Ald_i(E), E^ G E, m G E^ be 

given. The f-adapted Gaussian density for E^, E^ and m is defined as follows 

N{m, E,-, E,, /)(x) = iVK-, E,-)(x,-) • 7V(m,, E,)(a;, - /(x,-)) (2.4) 


Level sets for /-adapted Gaussian distributions with different types of 
functions are presented in Fig. 

Observation 2.1. The f-adapted Gaussian function N{m, E^, E^, /)(x), where 
f G C(E‘^“\E/ E^ G A4d_i(E), Erf G E, m G E^ is a density. 

Proof. Let A^(m, E) be a d-dimensional Gaussian density such, that E = 

, where E^ G A4d_i(E), E^ G E, m G E^. 

Let us consider a substitution 


0 

0 Ed 


iVi, • • • ,1/d) = {xi,.. .,Xd-i,Xd - fi^d))- 
In such a case, the Jacobian is equal to 


J{xi, ...,Xd) = det 


■ 1 

0 

0 

O' 

0 

1 

0 

0 

0 

0 

1 

0 




1 

_ dxi 

dx2 

... 

dxd-1 


Consequently, A^(m, E^, E^,/)(x) is a density. 


1 . 


□ 
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We will use the family of all h-dimensional Gaussian densities 
Moreover, for /: —>■ E, we will consider family of /-adapted Gaussian 

functions 


Af{W^ \E) := f): G /dd_i(E),m G E'^ and G E} . 

For the family C C(E‘^“^,E), we define 

M^(E‘^-\E)= |J{M/(E‘^“\E)}. 

feT 

We show that if contains all linear transformations, then ^(E*^) C ! 

Let us start with simple Lemma. 

Lemma 2.1. Let m G E*^, G Ald-i(E), G E and v G E*^”^ be given. 
\id-i 0 ■ 


Then for A = 


-1 
N ( Am, A 


we have 

where f: E*^”^ —>■ E such, that /(x) — v'^ ■ x. 

and m^ = [m^, m^], then we have 


Proof. Let us denote E = 
A^(^m,^EA'^)(x) = N 

= N 

= N 

It is easy to show that 

(AEA^)-^ 


0 


0 E 


/d-i 0 
V* 


d—1 

^ 1 


mj 


md 


Id-1 0 

v^ -1 


0 

0 E. 


Id-1 0 

v^ -1 




(x) 


^d 


E 

1 

a 

j 

T' 

^d 



y mj - md_ 

j 


! 0 
E. 


Id-1 V 
0 -1 




(x) 


v^E.~ v^E.-v-FE. 


X 


A 


n -1 


v^Ej v^EjV Ed 


0] 


r T 
vv 

—V 


0] 

1 

—V 

d 

. 0 0. 


T 

— V 

1 

— 

d 

_ 0 0_ 


1 
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Therefore we have 


[xix,]iAEA^) 


T\-l 




Xd 




EtI o' 

d 

0 0 




-1 


[-v^ 1] 


-1 


Xj 


Xd 


= ^d^d ^d+i^d-^ay)^d (a^d-Xrf'V ) = [x^-,a;d-v xj] 

As a simple consequence we obtain the assertion of the Lemma 



1 

0 

-1 

1 

1 

1 _ 

1 

0 

EdJ 


1 

> 

1 


□ 


Now we show that /-adapted Gaussian densities are an extension of the 
classical Gaussian model. 

Theorem 2.1. Let J- = {f: M: /(x) = • x for v G be the 

family of all linear transformations from into M. Then 

= g{R^). 

Proof. To prove the assertion, we first show the following inclusion: 


Let m G S = 


0 

0 Ew 


(where Ej G Md-i, E^ G M), v G 


i-l 


and 


/(x) = v^ • X be given and let A'(m, E^, E^,/) G ^,R)- Thanks to 

Lemma 2.1 for A = , we have 

V —1 

iV(m, E^-, Td, f) = N{Am, ATA^) G e?(M). 

We now show the opposite inclusion 

Let E = r ^ Ald(IR) and m G be given and let N(m, E) G 

V L 22 J 

We put Ej = Ell, Ed = — v^E //v + E 22 , /(x) = E/|^v^x and 

A=r / 0- 


v^E// -1 


iV[A-VE,~,Ed,/] =N 


. Thanks to Lemma 

/ 


m. 


2.1 

0 


v^E// -1 


we have 

Ell 0 

0 -v^E//v + E22 


I E/iV 
0 -1 
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= A/^ ( m, 


Sn 

,rT 


- E22 


I Er/v' 

0 -1 


= ( m, 


Sn 



Consequently 


N (m, E) = iV E,-, E,, /) e 

what finished the proof. □ 

The following observation is a corollary of Theorem in 

Corollary 2.1. Let T C contains the family of all linear trans¬ 

formations from into M. Then 


Consequently afCEC is a natural extension of the classical CEC algo¬ 
rithm. If we consider T containing only linear transformations, we obtain 
exactly the CEC algorithm. On the other hand, for wider classes of functions 
we detect more general clusters, which describe groups concentrated around 
manifolds which are not necessarily linear. 


3. Theoretical background of afCEC 

In this section the theoretical background of afCEC will be presented. 
First, we introduce the cost function which will be minimized by the algo¬ 
rithm. Then we prove that the optimal function which describes each cluster 
can be obtained by least square regression [l3j. We will end by describing 
the full algorithm of afCEC. 

Our method is based on the CEC approach. Therefore, we start with a 
short introduction to the method (for a more detailed explanation we refer the 
reader to |H3])- To explain CEC we need to introduce the cost function which 
we want to minimize. In the case of splitting oi X C into Xi,..., X^ such 
that elements of W we code by function from family of all Gaussian densities 
the mean code-length of a randomly chosen element x equals 

k 

E{Xi ,.. .,Xk;g{M.^)) ■.= Y,Pi- i-HPi) + (3.1) 

1=1 

where p* = . The formula uses cross-entropy of a data set with respect to 

the family 
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The aim of CEC is to find splitting of into sets Xj which minimize the 
function given in (3.1). Our goal is to calculate an explicit formula for the 
cost function in the case of /-adapted Gaussian densities. 


3.1. Cost function of one cluster 

In this section we will focus on the situation of one cluster X. In such a 
case we usually understand the data as a realization of a random variable. 
Consequently, as an estimator for the mean and covariance, we use 

mean(X) := 


V- 

n 

xex 


cov(X) := — — mean(X))(x — mean(X))^. 

^ xev 

As it was said, CEC uses cross-entropy of data set X with respect to the 
Gaussian family ^(E^). 

Theorem 3.1. Let X CW’’ be given. Then 

i/x(X||^l(E‘')) = inf H^{X\\g) = ^ ln(27re) + ^ ln(det(E)), 
qeG(RL 2 2 


where E = cov(X). 

The CEC algorithm will be used for a family of /-adapted Gaussian 
densities. In such a case the cost function is described by the following 
theorem. 

Theorem 3.2. Let X C E'^ and a function f G C(E‘^“\E) he given. Then 
H^{X\\Af{M.^~\R)) = ^ln(27re)+^ln(det(E^'))+iln ( ^ 

\ xev 


where E^ = cov(X^) and ma = mean(Xfc). 

Proof. LetX(m, Ej, Ed,/)(x) G E), where E^- G AId_i(E), E^ G 


m G E*^ and E = 


0 

0 E. 


. The assertion of the proposition is a simple 
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consequence of 

H-‘{X\\N(m,Si,S,,f)) = -iEx«li'(Af(m,S,-,S,./)(x)) = 
= - W E In ■ iV(mj. Ej)(x, - = 


xGX 


= -]k\ E (In (Nim^, S^')(x^')) + In iN{md, ^d){xd - /(x<|)))) = 


xGX 


= -^ E In {N{m^, S^-)(xj)) - ^ E In (A^(md, Ed)(a;d - /(xj))) = 

xGy xGX 

We can use Theorem |3.1| for both summands separately: 

H>'(xiM/(r-MR)) = H>‘(xja(R''->)) + HX( 4 lie(R)) = 

= ^ ln(27re) + 1 In (det (cov(Xj))) + 1 ln(27re) + pn ( i J] (xj - /(xj) - mj)'- 

\ xex 

□ 

As a corollary from the above theorem, we obtain that the optimal from 
the cross-entropy point of view function which describes a cluster can be 
obtained by a least squares method jl3] . 

Observation 3.1. Let X C be a data set and a family of functions 
T C he given. Then 


argmin ^ (X11.4,/ (M' 

feT 


\R^)) = argmin | kd - /(x^-) 


rrid 


where rud = mean(Xd). 

Consequently, we minimize cross-entropy by finding a least squares esti¬ 
mation. Moreover, if W is a set of function which are invariant under the 
operations / ^ a -|- / for any a, it is enough to find 

argmin I Xd - /(xj)p. 

Corollary 3.1. Let X C be a data set, and let a family of functions 
J- C be invariant under the operations f ^ a + f for a G E. Let 

f £ X be such that f = argmin |xd — /(x^)p. Then 

/6.F 


minWX(XM/(E 


d—1 


^)) = ^ ln(27re) + ^ ln(det(S^x)) + ^ In - /(x^-)) 


xGX 
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where = cov(Xj). 

The above theorem guarantees that the cost function is decreasing during 
iterations. The analogue of this result does not hold for AcaGMM (PCA is 
used for hnding a local coordinate system). Consequently, in afCEC (con¬ 
trary to AcaGMM) we are able to construct a simple stop condition. 


3.2. Coordinate system in afCEC model 

In the previous subsection there was shown how to determine optimal 
parameters for one cluster in arbitrarily given coordinate system. Now we 
describe how to fit the optimal one for the afCEC method. 

In AcaGMM the PCA (Principal Component Analysis) was used for find¬ 
ing a locally adapted coordinate system. Unfortunately, this operation causes 
problems with convergence (it is hard to construct a reasonable stop condi¬ 
tion). More precisely, by using PCA we do not minimize a cost function 
which is connected with least squares estimation. By applying two methods 
(PCA and regression) separately we do not minimize any of them. 

In the case of afCEC all computation use the canonical basis. We need 
only to decide which coordinate is chosen as dependent (then the rest becomes 
automatically explanatory). 

Our intuition to verify all possible coordinates in the canonical basis came 
from the Implicit Function Theorem [Hj. More precisely, under reasonable 
assumptions, for an implicit function T'(x) = 0 where F: —?• M and an 

arbitrary zero x G of F, we can find k G {1,..., d} and / : —>■ M 

such that locally in the neighborhood of x 

(x G F(x, r) : F(x) = 0} 

= {(xi,... ,a:fc_i,/(x^),Xfc+i,... ,a:d) : x^ G F(x^,r) C 

where F(x, r) is a ball with center x and radius r. 

Consequently for data A C E'^ we search for fc = 1,..., d and / such that 
X can be optimally approximated by the set 


{xi,..., Xk-i, /(x^), Xfc+i,..., a:^) for X G A. 


Example 3.1. Let us consider a c-type set, see Fig. When using the 
canonieal basis ofM?, we have to consider two possible estimated eurves (in 
our ease parabolas). We can treat x as a dependent variable, see Fig. 8(a)[ or 


we choose y as a dependent one, see Fig. 8(b)\ If we assume that dependent 
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(a) The c-type set and parabola fitted with 
assumption that x is the dependent vari¬ 
able. 



(b) The c-type set and parabola fitted with 
assumption that y is the dependent vari¬ 
able. 


Figure 8: Estimation of f-adaptive Gaussian density in two different coordinates. 

variable is x, we obtain the parabola x = — 1.4602t/ -|- 0.4078, and 

the sum of squared errors is equal to 1.420948. On the other hand, if y is 
the dependent coordinate, we have y = 0.8a:^ — 0.3756a: -|- 0.4997 with squared 
errors 53.35997. Consequently, the optimal coordinate system is describe by 
using x as the dependent variable. 

In the above example, we consider only bnt in higher dimensional 
spaces we have to consider d different possible choices of dependent variable: 
{Xf,,Xk) hv k e 1,... ,d. 

In conclusion, for one cluster X C we can estimate parameters of 
the model in two steps. First, we consider all possible choices of dependent 
variable: functions fk (corresponding with relations Xk = /(x^.)), means = 
mean(X^ ), m^ = mean(X^) and covariances = cov(X^), E^ = cov{X[ ) 
for k = 1,... ,d. Then we determine the optimal dependent variable 

j = argminlTT" (X|| Al([m^, E^, E^,/a,)) } . 

k=l,...,d 

Consequently, our data set is represented by the function, mean and covari- 
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ance matrix 

C. 01 

0 E,J 

where subscript j G {1,..., c?} denotes the dependent variable in cluster. 

The full algorithm can now be described. We use an adapted Lloyd’s 
method which is based on the simultaneous application of two steps. First, 
we construct a new division of X by matching each element x G X to a group 
such that the cost function is minimal. Then, we estimate new parameters 
in each cluster by applying the method presented in previous subsection, see 
Algorithm 

4. Experiments and analysis 

In this section we present a comparison of the afCEC method with AcaGMM, 
GMM and GEG. Since AcaGMM is not a density model, the Log-likelihood 
function is not well-defined. Nevertheless, by the input the Jacobian of 
AcaGMM transformation, we obtain a valid probability distribution, see Ap¬ 
pendix. To compare the results we use the standard Bayesian Information 
Griterion (BIG) 

BIC = -2LL + k\og{n) 
and Akaike Information Griterion (AIG) 

AC I = -2LL + 2k, 

where A: is a number of parameters in the model, n is a number of points, and 
LL is a maximized value of the Log-likelihood function. Consequently, we 
need a number of parameters which are used in each model. In a case of M^, 
AcaGMM uses two scalars for mean, three scalars for covariance matrix, two 
scalars for parabola and one for local coordinate system (obtained by PGA). 
On the other hand, in afCEC we do not need scalar for the local coordinate 
system. Consequently, afCEC uses two scalars for mean, three scalars for 
covariance matrix and two scalars for parabolsj^ 

Let us start from a synthetic data set. First, we report the results of 
afCEC, AcaGMM, CEC and GMM in the case of a circle-type set, see Fig.|^ 


^It should be emphasized that in afCEC we need to remember which coordinate is the 
dependent one. This parameter is discrete so we do not consider it in our investigation. 


f = fj m=[m-.,0], E = 
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Algorithm 1 afCEC : 

Input 

number of clusters /c > 0 
curve family 
stop condition e > 0 
dataset X (d - dimension of data) 
initial conditions 

obtain initial clustering Xi,... ,Xk 
obtain probabilities pi = ^ for i = 1,... ,k 

obtain parameters of each cluster /j, mean m^ and covariances Ep Ej in 
each cluster (choosing the best orientation) 
obtain cost function 

ho = Ell Pii-HPi) + H" (^ill 0]^, Ep E^))) 

repeat 

n = 0 

obtain new clustering Ai,..., by matching elements to the cluster 
such that (- In(pj) - ln(A/'([m., 0]^, Ep E*))) is minimal 

delete unnecessary clusters (|Aj| < 1% • |A|) by adding elements to 
the closest existing one 
update parameter k 
n = n + 1 

obtain new probabilities Pi = ® for i = 1,..., fc 
obtain new parameters of each cluster fi, mean m~- and covariances 
Ep Ej in each cluster (choosing the best orientation) 
obtain new cost function 

hn = EllPii-HP^) + HHX^\\N{[rapOf,^,l:,))) 
until hn > hn-i — s 
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(a) afCEC (b) AcaGMM (c) GMM. 


(d) GEG 





(e) Evolution of Log- 
likelihood function when 
the number of clusters 
increases from 1 to 10. 


(f) Evolution of Log- 
likelihood function when 
the number of parameters 
increases from 1 to 80. 


(g) Evolution of BIG 
function when the num¬ 
ber of cluster increases 
from 1 to 10. 


Eigure 9: Results of afGEG, AcaGMM, GEG and GMM in the case of circle-type set. 
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Fig. |9(e)| shows how the Log-likelihood function changes when the number 
of clusters increases from 1 to 10. Similar relation, in respect to number of 
parameter^ is presented in Fig. |9(f)| For a similar values of Log-likelihood 
function, we need 2 clusters in afCEC and AcaGMM and 4 in GMM and 
GEG, see Fig. In such a case, the BIG criterion shows that algorithms 
which use curved densities model better fit data with using smaller number 
of parameters. 



(a) AfCEC 




(b) AcaGMM (c) GMM 



(d) GEG 



(e) Evolution of Log- 
likelihood function when 
the number of clusters 
increases from 1 to 15. 



(f) Evolution of Log- 
likelihood function when 
the number of parameters 
increases from 1 to 120. 



(g) Evolution of BIG 
function when the num¬ 
ber of cluster increases 
from 1 to 10. 


Eigure 10: Results of afGEG, AcaGMM, GEG and GMM in the case of spiral-type set. 


Similar situation can be observed in a more complex case of spiral-type 
set, see Fig. In Table [T} the mean and maximum value of Log likelihood 
for 100 initializations of algorithms are shown. As we see for a similar values 
of Log-likelihood function, we have to use 9 clusters for afGEG and AcaGMM 


^Plots which present relation between Log-likelihood functions and number of param¬ 
eters was constructed by linear approximation of known values of the function. 
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NP 

afCEC 
mean LL 

max LL 

NP 

AcaGMM 
mean LL max LL 

NP 

GMM 
mean LL 

max LL 

NP 

CEC 
mean LL 

max LL 

1 

7 

-6178,30 

-6178,30 

8 

-6180,86 

-6180,86 

6 

-6180,68 

-6180,68 

6 

-6180,68 

-6180,68 

2 

14 

-6153,61 

-6069,15 

16 

-6182,41 

-6104,58 

12 

-6172,16 

-6157,13 

12 

-6170,67 

-6127,52 

3 

21 

-6109,99 

-6012,19 

24 

-6174,88 

-6068,96 

18 

-6173,61 

-6128,87 

18 

-6139,47 

-6088,73 

4 

28 

-6070,96 

-5924,87 

32 

-6127,14 

-5987,66 

24 

-6165,16 

-6062,66 

24 

-6102,95 

-6041,99 

5 

35 

-6006,17 

-5868,56 

40 

-6051,35 

-5836,19 

30 

-6131,26 

-6026,12 

30 

-6066,26 

-5989,85 

6 

42 

-5952,44 

-5713,61 

48 

-5972,21 

-5667,10 

36 

-6093,05 

-5990,69 

36 

-6028,42 

-5953,28 

7 

49 

-5905,57 

-5675,39 

56 

-5848,57 

-5558,34 

42 

-6031,69 

-5930,99 

42 

-5987,86 

-5882,36 

8 

56 

-5817,57 

-5612,98 

64 

-5763,63 

-5511,39 

48 

-5989,59 

-5868,69 

48 

-5954,45 

-5865,29 

9 

63 

-5764,29 

-5509,13 

72 

-5702,82 

-5482,30 

54 

-5931,64 

-5814,95 

54 

-5911,52 

-5804,07 

10 

70 

-5702,46 

-5494,73 

80 

-5644,46 

-5460,11 

60 

-5846,39 

-5741,61 

60 

-5865,69 

-5766,89 

11 

77 

-5654,33 

-5441,22 

88 

-5601,65 

-5435,63 

66 

-5802,84 

-5689,48 

66 

-5817,09 

-5713,91 

12 

84 

-5619,46 

-5410,99 

96 

-5599,81 

-5448,04 

72 

-5752,16 

-5636,58 

72 

-5797,22 

-5664,69 

13 

91 

-5598,93 

-5430,61 

104 

-5566,99 

-5423,52 

78 

-5725,24 

-5609,44 

78 

-5757,77 

-5623,56 

14 

98 

-5558,18 

-5384,77 

112 

-5553,93 

-5420,68 

84 

-5682,13 

-5542,43 

84 

-5720,74 

-5563,87 

15 

105 

-5538,44 

-5392,29 

120 

-5547,13 

-5431,19 

90 

-5651,20 

-5554,23 

90 

-5683,41 

-5555,02 


Table 1: Comparison of the afCEC, CEC and GMM Chinese and Latin characters. 


and 14 for GMM and CEC. The comparison of algorithms by using BIC and 
AIC with similar values of Log-likelihood function we present in Tab. 

Algorithms which are able to adapt to curve type structures (AcaCMM, 
afCEC ) better fit data. More precisely, the Log-likelihood function takes a 
larger value with the same number of parameters, see Eig. |9(f)| and Eig. 10(f)[ 
Since Log-likelihood increases with growing of the number of classes, we use 
BIC criterion which takes into account the number of parameters. In the 
case of AcaCMM and afCEC, we obtain optimal value of BIC after about 4-6 
iterations. In conclusion, AcaCMM and afCEC better fit data (yield a higher 
value of Log-likelihood function) while require lower number of parameters. 

Algorithms AcaGMM and afCEC give a comparable value of Log-likelihood, 
see Fig. 9(e) and Fig. |10(e) Nevertheless, afCEC uses less parameters, see 


Fig. 9(f) and Fig. 10(f)[ Moreover, strong theoretical background of the 


method guarantees that the cost function decreases in each iteration. Con¬ 
sequently, we obtain a simple stop condition for our method. 

Chinese characters mainly consist of straight-line strokes (horizontal, ver¬ 
tical) and curve strokes (slash, backslash and many types of hooks). GMM 
has already been employed for structure analysis of Chinese characters, and 
achieves commendable performance [32]. However, some lines extracted by 
GMM may be too short and is quite difficult to join these short lines to form 
semantic strokes due to the ambiguity of joining. This problem becomes 
more serious when analyzing handwritten characters by GMM, and this was 
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Algorithms 

Number of 
clusters 

Number of 
parameters 

Log-likelihood 

BIG 

AIC 

afCEC 

9 

9-7=63 

-5508.83 

11452.85 

11143.66 

AcaGMM 

9 

9-8=72 

-5497.11 

11491.58 

11138.22 

GMM 

14 

14-6=84 

-5520.96 

11622.17 

11209.92 

CEC 

14 

14-6=84 

-5510.09 

11600.44 

11188.18 


Table 2; Comparison of afCEC, AcaGMM, CEC and GMM in the case of spiral-type set, 
see Fig. [T^ 



NP 

afCEC 

LL 

BIC 

NP 

AcaGMM 

LL BIC 

NP 

GMM 

LL 

BIC 

NP 

CEC 

LL 

BIC 

it 

5 

1148.57 

-2071.93 

5 

1030.02 

-1802.66 

7 

1060.26 

-1850.18 

7 

1015.29 

-1760.33 


5 

1000.78 

-1770.42 

5 

959.71 

-1655.26 

7 

1170.85 

-2064.33 

7 

1175.96 

-2074.55 


4 

1009.02 

-1836.69 

4 

880.32 

-1553.38 

5 

824.51 

-1454.71 

5 

811.76 

-1429.21 


4 

1009.02 

-1836.69 

4 

880.32 

-1553.38 

6 

1027.55 

-1821.93 

6 

1032.92 

-1832.67 

f/l 

6 

1329.01 

-2372.74 

6 

1272.74 

-2219.45 

8 

1364.94 

-2403.85 

8 

1422.27 

-2518.51 


4 

1045.53 

-1911.53 

4 

921.65 

-1638.12 

5 

900.25 

-1608.15 

5 

902.12 

-1611.89 


4 

1045.53 

-1911.53 

4 

921.65 

-1638.12 

6 

1017.13 

-1803.44 

6 

1018.31 

-1805.79 


5 

1011.27 

-1794.47 

5 

962.93 

-1665.21 

7 

1079.99 

-1840.69 

7 

1181.03 

-2042.77 

b 

3 

2660.87 

-5158.99 

3 

2738.24 

-5290.49 

4 

2686.59 

-5187.19 

4 

2678.49 

-5170.99 

R 

3 

1911.73 

-3652.67 

3 

1578.61 

-2962.04 

4 

1996.56 

-3797.94 

4 

1989.31 

-3783.43 

s 

3 

1883.88 

-3604.71 

3 

1907.83 

-3629.32 

4 

1875.93 

-3565.52 

4 

1866.01 

-3545.68 


Table 3: Comparison of the afCEC, AcaGMM, CEC and GMM methods for Chinese and 
Latin characters. 


the motivation to use AcaGMM to represent Chinese characters. In Tab. 
we present a comparison of afCEC, AcaGMM, GMM and CEC on Chinese 
and Latin characters: it (dog), (beg), (father), {TL (mother), ii (fire), 
(master), b, R, S. The number of clusters has been determined so as to 
obtain a similar value of Log-likelihood function. 

5. Appendix—AcaGMM Gaussian model 

As it was previously mentioned, AcaGMM does not use densities. More 
precisely, the Jacobian of the transformation was not taken into considera¬ 
tion. However, the EM procedure, which was used in AcaGMM, works with 
probability distributions. Therefore, from the theoretical point of view the 
above procedure is incorrect. Moreover, if we want to compare our method 
by using of the Log-likelihood function we need densities. 

Let us start from numerical integration of the original AcaGMM function 
and of the model rescaled by Jacobian correction. The Simpson method |45j . 
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on the square [—5, 5] x [—5, 5] with 50000 segments was used. The integral 
in the case of AcaGMM is equal to 1.038. After correction we obtain 1 (with 
a precision of 10^). 

Let us consider situation of the AcaGMM model. Suppose X and Y are 
zero mean independent Gaussian distributions with variances cri,(T 2 : 


NxY{x,y) 


'J 27ro'iO'2 


exp 


2cri(T2 J ■ 


Moreover, let 


Z = g(X,V), W = h(X,V), 


where g,h & Let J{x,y) represent the Jacobian of the original 


transformation 


J{x,y) 


det 


9g(x,y) 

dx 

dh{x,y) 

dx 


dg{x,y) ~ 

dy 

dh(x,y) 
dy _ 


In such a case, we have 




E 

{(a:,j/)eR2: (g(a;,j/),/i.(a;,j/))=(z,-!u)} 


NxY{x,y) 

\J{x,y)\ 


Let us consider the function / expressed as parametric equation / := 
{{x{t),y{t)): t G E} (in the case of AcaGMM it is a parabola). Using the 
formula from Table 1.] we obtain the orthogonal projection (x(to), l/(^o)) 
of point {pi,P 2 ) on curve /: 


to=Pf{Pi,P2) 


Vr + Vd+^r-^/d 

0 , 0 , 0 

2\/—Q, —\/—Q, —\/—Q 

cos(^)A = 0,1,2, 

where 6 = acos ( ^ — 1 

Vv^y 


> 0 

L> = 0,Q = i? = 0 
D <0 


where Q = R = ^ and D = + R\ 

On the other hand, the arc length of / between zero and (x(to),|/(^o)) 
m Formula (10)] is given by 


K^o) = 


2 yol'y/l + + — In ^2|a|to + 
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Consequently, we have 


9 ^{Pi,P2) = Ux{to),y{to)) - {pi,P2)\\, 


where to = Pf{pi,P 2 )- 


h ^iPi,P2) = /(to), 


{h{x,y),g{x,y)) 

- y 




Figure 11: The transformation used in AcaGMM. 


Our goal is to determine the Jacobian of our transformation, see Fig. 

Let us consider an arbitrary small neighborhood of (x(to),2/(^o))- In such a 
case, the local curvature of / at (x(to), |/(^o)) is the same as the curvature of 
the osculating circl^at (a:(to),|/(^o))- 

The radius of curvature in the case of parametric form of curve is given 

by 

{x''^ + 

r =-. 

x'y" — y'x” 


Consequently, our goal is to determinate how a set is changing under the 


influence of the transformation, see Fig. 12 


®In differential geometry of curves, the osculating circle of a sufficiently smooth plane 
curve at a given point p on the curve has been traditionally defined as the circle passing 
through p and a pair of additional points on the curve infinitesimally close to p. Its center 
lies on the inner normal line, and its curvature is the same as that of the given curve at 
that point. This circle, which is the one among all tangent circles at the given point that 
approaches the curve most tightly, was named circulus osculans (Latin for “kissing circle”) 
by Leibniz. 
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A small square neighborhood of the point (pi,P 2 ) is mapped to a trapezoid 
(asymptotically when a size of square converges to zero). This operation is 
showed in Fig. It is easy to see that the square area changes linearly 
depending on the distance p. If we consider the situation where p = r, we 
obtain that our square is collapsed to a point. Consequently, for points above 
the curve Jacobian is asymptotically proportional to 

r — p ^ p 

r r 

In a natural way, if a point {pi,P 2 ) is under the curve, the square area is 
increasing under the influence of the transformation. Therefore, the Jacobian 
is asymptotically equal to 

r + p P 


r 


(pi 


P 


P2) 



Figure 12: Transformation of a square neighborhood of a point under the influence 

of the AcaGMM function. 

Now we have the formula for the Jacobian of AcaGMM transformation, 
but it depends on the relation between a point and its orthogonal projec¬ 
tion. More precisely, we have to verify which formula should be used (or 
equivalently on which side of parabola a point is found), see Fig. 

We can easily verify where the point (pi, P 2 ) is in relation to the orthogonal 
projection (x(to), |/(^o)) by checking the orientation of a basis containing the 
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Figure 13: Position of the point and its orthogonal projection on the parabola f{x) = x^. 
The distance between point and his orthogonal projection, when it is situated above the 
curve is marked by a solid line. On the other hand, if the relationship is reversed, we mark 
the projection by a dashed line. 


normal vector (^ 1 ,^ 2 ) — (^(^o), ?/(^o)) and the tangent vector {x'{to)^ y'{to)) 
at a point (x(to),?/(^o))- Consequently, we have to verify the sign of the 
determinant 


det 


Pi - x(to) x'ito) \ 

P2-yito) 
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